The WDC Gold Standards for Product Feature Extraction and Product Matching
نویسندگان
چکیده
Finding out which e-shops offer a specific product is a central challenge for building integrated product catalogs and comparison shopping portals. Determining whether two offers refer to the same product involves extracting a set of features (product attributes) from the web pages containing the offers and comparing these features using a matching function. The existing gold standards for product matching have two shortcomings: (i) they only contain offers from a small number of e-shops and thus do not properly cover the heterogeneity that is found on the Web. (ii) they only provide a small number of generic product attributes and therefore cannot be used to evaluate whether detailed product attributes have been correctly extracted from textual product descriptions. To overcome these shortcomings, we have created two public gold standards: The WDC Product Feature Extraction Gold Standard consists of over 500 product web pages originating from 32 different websites on which we have annotated all product attributes (338 distinct attributes) which appear in product titles, product descriptions, as well as tables and lists. The WDC Product Matching Gold Standard consists of over 75 000 correspondences between 150 products (mobile phones, TVs, and headphones) in a central catalog and offers for these products on the 32 web sites. To verify that the gold standards are challenging enough, we ran several baseline feature extraction and matching methods, resulting in F-score values in the range 0.39 to 0.67. In addition to the gold standards, we also provide a corpus consisting of 13 million product pages from the same websites which might be useful as background knowledge for training feature extraction and matching methods.
منابع مشابه
Comparative Cyanide and Thiourea Extraction of Gold Based on Characterization Studies (TECHNICAL NOTE)
This study involved preliminary laboratory test work to identify the relative leaching response to cyanidation and thiourea leaching of an oxidized low grade gold ore based on characterization studies. Huge reserves of gold deposits have been reported from different parts of Iran, especially at Neishaboor area with gold grade of approx 4 ppm. For the mineralogical composition, the nature of the...
متن کاملContourlet-Based Edge Extraction for Image Registration
Image registration is a crucial step in most image processing tasks for which the final result is achieved from a combination of various resources. In general, the majority of registration methods consist of the following four steps: feature extraction, feature matching, transform modeling, and finally image resampling. As the accuracy of a registration process is highly dependent to the fe...
متن کاملA Machine Learning Approach for Product Matching and Categorization
Consumers today have the option to purchase products from thousands of e-shops. However, the completeness of the product specifications and the taxonomies used for organizing the products differ across different e-shops. To improve the consumer experience, e.g., by allowing for easily comparing offers by different vendors, approaches for product integration on the Web are needed. In this paper,...
متن کاملThe WeSearch Corpus, Treebank, and Treecache - A Comprehensive Sample of User-Generated Content
We present the WeSearch Data Collection (WDC)—a freely redistributable, partly annotated, comprehensive sample of User-Generated Content. The WDC contains data extracted from a range of genres of varying formality (user forums, product review sites, blogs and Wikipedia) and covers two different domains (NLP and Linux). In this article, we describe the data selection and extraction process, with...
متن کاملEnriching Product Ads with Metadata from HTML Annotations
Product ads are a popular form of search advertizing offered by major search engines, including Yahoo, Google and Bing. Unlike traditional search ads, product ads include structured product specifications, which allow search engine providers to perform better keyword-based ad retrieval. However, the level of completeness of the product specifications varies and strongly influences the performan...
متن کامل